feat: ConvoMem sampled adapter with range-probe selective fetch#21
Merged
Conversation
ConvoMem (Salesforce, Apache-2.0, ~75K QA pairs) ships as pre-mixed test cases — self-contained conversation haystacks plus questions — that map 1:1 onto the grouped runner mode. - datasets/convomem.py: the full dataset is multi-GB (one size-300 batch is ~850MB), so fetching is selective. Batch files within each <category>/<N>_evidence/ dir are ordered by case context size, and an HTTP Range tail-probe (last 4KB) reads each file's final contextSize without downloading it. Only files matching the requested sizes are fetched; index.json records every probe result including files NOT downloaded, so the selection itself is auditable. Probes are throttled with retries (HF CDN resets rapid bursts). - converters/convomem_to_corpus.py: stratified deterministic sampling by (category, contextSize) with a fixed seed; sampling.json records seed, per-stratum population, and sample counts so a published number states exactly which slice of ConvoMem it covers. Leakage scrub: containsEvidence/model_name dropped, conversation ids remapped to neutral positional ids. Ground truth maps evidence conversation ids through the remap; abstention evidence referencing absent conversations yields empty ground truth by design. - CLI: datasets fetch --dataset convomem --context-sizes; convert convomem --sample-per-stratum/--seed/--context-sizes. justfile recipes + README section. Live-verified against the real dataset (user_evidence, size 10): 50 files probed, exactly 2 downloaded, 3 cases sampled -> 30 docs / 30 queries (size-10 cases pack 10 questions per haystack, each targeting a distinct evidence conversation), all ground truth non-empty and remapped. 7 new unit tests; suite green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Drew Cain <groksrc@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the last Phase-1 benchmark: ConvoMem (Salesforce, Apache-2.0, ~75K QA pairs), as a documented stratified sample. Pre-mixed test cases map 1:1 onto grouped runner mode.
Design
contextSizewithout downloading. Only matching files are fetched.index.jsonrecords all probe results including files not downloaded — the selection is auditable. Probes throttled + retried (HF CDN resets rapid bursts; hit and fixed live).sampling.jsonrecords seed + per-stratum population/sample counts. A published number states exactly which slice it covers.containsEvidence/model_namescrubbed, conversation ids remapped to neutral positional ids (covered by test).Verification
🤖 Generated with Claude Code